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1. Introduction 

Gaussian processes have been adopted as building blocks for constructing prior 
distributions on infinite-dimensional statistical models in several settings. For 
instance, a sample path of a Gaussian process can be used directly as a prior 
model for a regression function (see e.g. [5], [22], [S]); after a monotonic trans- 
formation to the unit interval it can be used in the setting of classification (e.g. 
[16] . [1], [5]); and after exponentiation and renormalization it becomes a model 
for density estimation (e.g. [TT], Hj, [T^). 

Priors on functions of a single variable are commonly constructed using sta- 
tionary Gaussian processes with smooth sample paths (e.g. [T], [S], [TU], [IB])). 
A popular example is the so-called squared- exponential process^ i.e. the centered 
Gaussian process W with covariance function EVKgW't = aexp(— — sp) for 
some a, 6 > 0. The existing mathematical literature concerned with priors of 
this type focusses on computational issues (e.g. [16], [1], [10]) or posterior con- 
sistency ([5]). In the present paper we study posterior convergence rates, i.e. 
the rate at which the posterior distribution contracts around the true unknown 
functional parameter of interest. In particular, we are interested in exhibiting 
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priors yielding optimal convergence rates if the unknown function belongs to a 
smoothness class. 

It is well known that in the frequentist setup in which the data are sampled 
from a fixed true distribution, a prior on an infinite-dimensional model has to be 
carefully chosen in order to obtain optimal rates of contraction of the posterior 
(cf. e.g. [3], [17], [19], [5, [25], [23]). Even if posterior consistency is ensured, 
the rate of contraction of the posterior around the true functional parameter 
will typically be suboptimal if the regularity of prior process does not equal the 
regularity of the unknown parameter. Since Gaussian processes like the squared 
exponential process have infinitely often differentiable sample paths (at least 
in mean square sense), they will, at least without modification, typically not 
be appropriate as a prior model for a functional parameter with some finite 
smoothness level, in the sense that they yield suboptimal contraction rates. 

In this paper we show however that this can be remedied by suitably rescaling 
the smooth process, with rescaling constants depending on the sample size. 
Given a fixed Gaussian process {Wt ■ t > 0) indexed by the positive time axis 
and scaling constants c„ > we use the rescaled sample path 

t^Wt/c„, te[0,i] (1.1) 

as a prior model for a given function : [0, 1] — > M that indexes the distribution 
of the observations. The rescaling has the purpose of changing the appearance 
of the sample paths, so as to make them reflect more closely our prior ideas 
about the true parameter. Scaling factors c„ — > cx3 stretch the restrictions of the 
sample paths 1 Wt to the time interval [0, l/c„] to the interval [0, 1] and hence 
use only a small part of the randomness in the Gaussian process. Typically, this 
has the effect of smoothing the sample path. Conversely, scaling factors c„ 
use the sample path t i-^ Wt on a long interval [0, l/c„] and shrink this to 
the interval [0, 1]. This typically makes the prior rougher, by incorporating the 
randomness of a longer time period. 

Coming back to the example of the squared exponential process, we will 
show that for any given regularity level a > there exist scaling factors c„ — > 
("roughening of the sample paths") such that after rescaling, wc obtain a prior 
yielding (up to logarithmic factors) optimal contraction rates if the true param- 
eter is a-smooth. To prove this result we use the general theory on posterior 
convergence rates for Gaussian priors developed in Van der Vaart and Van Zan- 
ten [50] . The results of the latter paper state that the rate of convergence for a 
Gaussian process prior W is determined by the behaviour of its concentration 
function 

^„„(£)= inf ||/i||^-logPr(||W^|| <£) 

||/i-too||<e 

for e ^ 0, where H is the reproducing kernel Hilbert space (RKHS) associated to 
the process W, \\ ■ ||h is the RKHS-norm, and || • || is the norm of the function space 
where W takes its values. More precisely, the results state that asymptotically, 
the posterior concentrates its mass on balls around the true parameter with 
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radius of the order £„, where e„ is found by solving 

^wo{£n)<nel. (1.2) 

Our results for rescaled smooth, stationary Gaussian process priors are ob- 
tained by studying the RKHS and small deviations behaviour of such processes, 
leading to upper bounds for their concentration functions. For W the centered 
Gaussian process with covariance function EW^lVt = exp(— |t — sp) and c > 
we find that for the rescaled process W = {Wt/c : < t < 1), and wo a C" 
function, 

-logPr( sup \Wt^\ < 2e) < i(log^)' 

o<t<i c ce 

and 

„ inf ^ < -. 

This implies that if we use rescaling rates c„ ^ 0, (|1.2p is solved for £„ > 
c" V (log n)/^ncn. The optimal choice c„ ~ (n/ log^ n)^^/'^^+^"' yields the rate 
En ~ (n/ log^ ri)~"/'^-'^+^"^ Up to a logarithmic factor, this is the well-known 
minimax rate for estimating a-regular functions. 

In addition to smooth stationary prior processes we also consider self-similar 
processes. At first thought one might expect that rescaling would have no ef- 
fect on such processes, but this turns out to be false. Stretching and shrinking 
causes the smoothing and roughening effects mentioned previously. Convergence 
rates for posteriors based on certain self-similar Gaussian process priors were 
obtained in [20j . We proved for instance that if is a /c-fold integrated Brow- 
nian motion (plus an independent polynomial part), then using W as prior on 
{k + l/2)-smooth functions yields an optimal convergence rate for the posterior. 
In this paper we show that after rescaling, this prior becomes appropriate for 
a larger range of smoothness levels. For any a e (0, /c + 1] there exist scaling 
factors c„ such that the prior based on the rescaled process Wt /c„ yields optimal 
contraction rates if the true parameter is a-smooth. In this case we have c„ 
("roughening") ii a < k + 1/2 and c„ ^ oo ("smoothing") ii a > k + 1/2. The 
range of a's for which rescaling the /c-fold integrated Brownian motion leads to 
a rate-optimal posterior is limited by the smoothness level A; -I- 1 of the functions 
in the RKHS of the process. Technically, the results for self-similar processes 
are relatively easy consequences of the general results obtained in [501 . 

The results of this paper can be viewed as mathematical support for the 
common use of rescaled Gaussian process priors in Bayesian practice (see for 
instance [T], [TUj, [H]). We show that, from a frequentist perspective, rescaling 
greatly enlarges the range of models for which a given Gaussian process prior is 
appropriate. In a practical setting one often tries to robustify a Bayes procedure, 
or reduce subjectivity, by employing a random rescaling, i.e. using the prior 
Wt/c with a random scaling factor C independent of W, rather than the prior 
Wt itself. Further analysis is necessary on this issue, but the results in this paper 
may serve as a starting point for such an investigation. 
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Another point that deserves elaboration is the extension of our resuhs to 
multivariate settings, i.e. cases where the unknown function of interest is a 
function of several variables. This requires however the generalization to higher 
dimensions of a number of approximation results (cf. Lemmas 12.21 and 12. 3[) . 
which, in general, can be technically quite involved. 

The remainder of this paper is organized as follows. In the next section we 
introduce and study the Gaussian processes that will serve as prior models. 
To prepare for the proofs of our main results on posterior convergence rates 
we obtain small deviation bounds and results on the approximation properties 
of the RKHS of rescaled smooth stationary processes and multiply integrated 
Brownian motions. The results on posterior contraction are precisely stated in 
the final Section [3l 

The notation < is used for "smaller than or equal to a universal constant 
times" , and x is "proportional up to constants" . We use the notation C[0, 1] 
for the space of continuous functions / : [0, 1] ^ R, endowed with the uniform 
norm, and Lj.{ijl) for the measurable functions / : [0, 1] ^ K or / : [0, 1] ^ R 
with 11/11^ j \f\' d^jL < oo. Furthermore, for ^ > we let C[0, 1] denote 
the Holder space of order /?, consisting of the functions / g C[0, 1] that have 
P continuous derivatives, for /3 the biggest integer strictly smaller than /?, with 

the /3th derivative /'■^■' being Lipshitz continuous of order (3 — 13. For e > let 
N{e, B, d) be the minimum number of balls of radius e needed to cover a subset 
S of a metric space with metric d. 

2. Prior processes 

The theorems on posterior contraction rates that we present in the next section 
concern two classes of priors. The first are constructed by rescaling smooth, sta- 
tionary Gaussian processes, the second by rescaling multiply integrated Brown- 
ian motions. In this section we study these rescaled processes, obtaining results 
on their small deviations behaviour and the approximation properties of their 
reproducing kernel Hilbert spaces (RKHSs). Together with the general theory 
of [20] , these results will allow us to obtain rates of convergence for posteriors. 

2.1. Smooth stationary processes 

Consider a centered, mean-square continuous Gaussian process W — {Wt : t > 
0) with covariancc function 



for a given continuous function : K ^ M. For a fixed scaling constant c > 0, 
we define the rescaled version W of the process W by setting W^r = Wt/c- 

By Bochner's theorem the function ip is representable as the characteristic 
function 



EWsWt=^{s-t), 



(2.1) 
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of a symmetric, finite measure on R, called the spectral measure of the pro- 
cess W. (The minus sign in the exponent is for consistency in notation, but is 
superfluous as fi is symmetric.) We shall consider spectral measures satisfying 
the condition 

e''!^! /i(dA) < oo (2.2) 



for some S > 0. This condition on the tails of the spectral measure should be 
viewed as a smoothness condition on W. It implies for instance that the sample 
paths of W are infinitely often differentiable, at least in mean-square sense. 
Examples of processes satisfying (12.21) are the centered Gaussian processes with 
covariance functions {s,t) i-^ exp(— |t — sp) or {s,t) {1 + \t — sp)""'^, which 
correspond, respectively, to Gaussian and Laplace spectral measures. Observe 
that in particular (|2.2p implies that fi has finite moments of all orders and hence, 
since 



the process W admits a continuous version. We shall work with this continuous 
version throughout. 

The spectral measure fj,c of the rescaled process is obtained by rescahng 

fi: 

fi,{B) = fi{cB). 

Let L2{nc) be the set of all functions ft, : R — > C whose modulus \h\ is square 
integrable with respect to /ic- Denote by Tch the transform !Fch : M — > C of the 
function h relative to the measure fic'- 

itXi 



{T,h){t) = J e-'*^h{X)dfia{X). 

Note that maps L2(Mc) into the space C(R) of continuous functions on the 
real line. 

The following lemma describes the RKHS of the process {W^ : t ^ [0, 1]). 
Recall that this space is defined as the completion of the linear span of the 
functions hg defined by hs{t) — EW^W^, with s S [0, 1], under the inner product 

Lemma 2.1. Under condition \2. 2]) the reproducing kernel Hilbert space of the 
process {W^ : < t < 1) (viewed as a map in C[0,1]) is the set of real parts 
of all transforms TJi (restricted to the interval [0, \]) of functions h e L2{^c), 
equipped with the square norm 

Proof. Although the RKHS is real by definition, it will be convenient to complete 
the linear span of the functions kg over the complex numbers. Because the 
functions kg are real-valued, the RKHS is the set of real parts of functions in 
this complex RKHS. 
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If Cs : R — > C denotes the function es{X) — e'''^, then, by the definition of 
the spectral measure, 



It follows that the linear extension of the map L : /i^ i-^ Cg is an isometry for 
the Hilbert space structures from lin(ft,s : s G [0,1]) C onto lin(es : s G 
[0, 1]) C i2(Mc)- Hence, H"^ is the inverse image under L of the closure of the 
set of functions lin(es : s G [0, 1]) in L2{iic)- Now, again by the definition of the 
spectral measure, 

(L-'es) it) - Kit) = mV^W^ = ^(^) = {Tcesm. 

It follows that the inverse L^^ is exactly the transform Tc- 

Finally we prove that lin(es : s € [0, 1]) is dense in L2{fJ.c), using the condition 
(|2.2p . As s I 0, by dominated convergence, 

(A) lA, 



as functions in L2{p-c)- Because the functions on the left side are in lin(es : 
s e [0,1]), the function A i-^ iA is in the closure of this set. We repeat this 
argument with the function (es — eo)(A)/s — iX to see that the function |(iA)^ 
is contained in the closure of lin(es : s G [0, 1]), and so on. We conclude that all 
polynomials A i-^ A'' are contained in this closure. By extension to the complex 
case of Proposition 6.4.1 of [15] and ()2.2|1 . it then follows that this closure is 
dense in the full space L2{nc)- □ 

Let ipc{x) — (p{x/c) and denote by (pc * G{t) — J ipc{t — s) dG{s) the density 
of the convolution of a signed measure G and the distribution corresponding to 
the density tpc- By Fubini's theorem, such a convolution can be written as 

if, ^G{t)^ 27: {T,G){t), 

for G the characteristic function of G, defined by 27rG'(A) = / e**"^ dG{t). Because 
27r|G| is uniformly bounded by the total variation of G, it is contained in L2{fJ-c) 
and hence the function * G is contained in the RKHS, with square norm 

||^c*G||^. = (2^)2 I \G\'d^i,. 

For a measure G on the interval [0, 1] this follows readily from the definition 
of the RKHS as the linear space spanned by the functions t i-^ MW^Wfr = 
ipcis — t) = Lpc* Ss{t). As shown by the preceding lemma, under condition (|2.2p . 
the functions tpc * G are contained in the RKHS for any signed measure G on 
the full line M. This will be important for the proof of the following lemma, 
which quantifies how well functions can be approximated by elements of the 
RKHS of the rescaled process W. 
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Lemma 2.2. Let /i satisfy h2. 2fl and possess a Lebesgue density that is bounded 
away from zero on a neighborhood of 0. Let (3 > Q be given. Then for any 
w £ C^[0, 1] there exist constants Cw and depending only on w such that, 
as c 10, 

inf{||/i||2. : \\h~w\\oo<C^.c^'} <D^{^). 

Proof. Let (3 be the biggest integer strictly smaller than f3. Let ip be the density 

of /i, and let i^iX) — {2tt)^^ J e**'*' ip{t) dt be the Fourier transform of a general 
function ip : M. —^ C The Fourier transform of the function tpc defined by 
ijjcix) = iIj{x/c) is given by ■!/'c(A) = cijj{cX). 

There exists a symmetric, integrable function V' : R ^ K with J dt — \, 
J t''ip{t) dt = for every k = 1,. and / dt < oo, and such that 

the function \ip\'^/(p is bounded. Take for instance a function ^ with compactly 
supported, symmetric, real- valued Fourier transform ip which equals l/(27r) in 
a neighborhood of zero, so that 



0, fc > 1, 
1 

2^' 



We can extend w : [0, 1] — ^ R to a function w : M ^ M with compact support 
and \\w\\i3 < oo. 

By Taylor's theorem we can write, for s, i g R, 

w{t+s) = Y,^^'Hty^+s{t,s), 

where, for some ^ G [0, 1], 

\S{t,s)\ = \fl\uj(E)it + ^s)-w^^Ht)\ < 

In view of the assumption that i/' is a higher order kernel, for any i G M, 
1 



— sc) — w(t)^ ds — J ilj{s)S{t, —cs) ds. 

Combining the preceding displays shows that \\c^^ipc*w — w\\oo < \\w\\pK/p\ 
for K = J\s\f^\il:\{s)ds. 

For w the Fourier transform of w, we can write 

{w * ^Pc){t) = 2Tr I e-''^w{\)i>,{X)dX = 2TrTJ—){t). 

J 'Pc 
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It follows that the function c ipc * w is contained in the RKHS, with square 
norm 

— w*Vc H= = — 27r) / — d/ic = - 27r / —— dX 

< i(27r)2 / |^(A)|2dA|l''^''ll 



C J " ip 

Here (27r)2 / \w{X)\'^ dX = J w^{t) dt is finite. □ 

Lemma 12.11 implies that under (|2.2p the elements of the RKHS can be con- 
tinuously extended to functions that are analytic on the strip {i e C : |Imt| < 
{c6)/2} in the complex plane. The following entropy estimate is therefore related 
to classical results on the entropy of spaces of analytic functions, obtained by 
Kolmorogov and Tihomirov 7 . They obtain estimates of the order (log(l/£)) 
for the entropy of the spaces we are interested in. The following lemma makes 
the dependence on the scaling constant c explicit, which is essential for the proof 
of our main results. 



Lemma 2.3. Assume that the spectral measure satisfies l\2.^) for some (5 > 0. 
Then the entropy of the unit ball H^j; of the RKHS of the process W — {W^ : 
< i < 1) (viewed as map in C[0, 1]) satisfies 

log7V(e,Il^,||.||oo) < ^(logj)' 

Proof. We construct an e-net of piecewise polynomials over M.1 . 

Because all moments of the spectral measure /j.^ are finite by (|2.2p . we can 
for any h S L2{fJ-c) differentiate the function J^c.h under the integral sign to find 
that {Tch)^^\t) = /(-a)'=e-**^/i(A)d^c(A). Consequently, 

Wim^^'^L ^ j |AhMA)|dA^c(A)2 

< y'lMA)pdMc(A) j \X\'^d^,,{X)^\\TMl^^. 

where au are the absolute moments of the spectral measure /i. By Taylor's 
formula it follows that, for every s,t G [0, 1], 

fc-1 1 



{T,h){t + s)^ Y^iW^'Ht)-^ + {Rkh){t, s), 
with remainder satisfying 



j=o 



fc! ' 11°°' ' - k\ c^ ' 



for Tch e Wi. 
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For given e,d> choose fc e N such ^ a2k{d / c)^ / k\ < e. Consider the set Ti. 
of functions 

h{t) = ki^-mMmJ2^^,j—^: (2-3) 

where "/ij ranges over the grid {0, iry^, ±2r/j, . . .} intersected with the interval 
[— yaij/c^ , yoij/c^] , for r]j = ejl/{d^k). For every function J^ch G H^j; and i 
there exist points 7^ j- in the grid such that 

(i-ij(l<t<!d J- J- j^f^ J- 

The function X]j=o (-^c/j)^-'H*'^)(^ ~ idy / 3^- is within uniform distance e of the 
function TJi on the interval {{i — l)d, id] . The preceding being true for every i 
implies that the set TL of piecewise polynomials (12. 3p forms a 2e-net over for 
the uniform norm on (0, 1], and hence the covering number Af(2e,H^, || • ||oo) is 
bounded by the number of points in 7i, which is equal to the number of different 
matrices (71 j)- The logarithm of this number can be bounded as 

.og« < log n n(2^ + 1) s fji + 

i=i j=Q '3 j=a •'' 

For any x > and > we have < e^r{j + 1). Indeed, 

/•CXD /'OO /'OO 

e"^r(j + l)=/ e-^-'-^'h^ ds ^ e-%s + xyds> e-'x^ds. 



Therefore, for any A we have |Ap^ < 5 ^^T{2j + l)e'^l'''l and, consequently, with 
K = J e^\^\dii{\), 

^_^dy ^ ^KT{2j + l)^d^j 



j\ c 5^j\ c 

In view of Stirling's approximation T{n + 1) x 7i"+^/^e^", the right-hand side 
is, up to a constant, for j > 1, equivalent to 



if(2jV+i/4e-. (1 + 0(1)) d_ , ^ /2dv 



jJ+i/2e-J (1 + 0(1)) ^Sc^ ~ Vfc 

We choose d < 6c/2 and fc ~ log(l/e) to reduce this expression for j — k to a 
number smaller than e. We have that {dj c)^ ^fa^ j j\ is bounded above uniformly 
in j = 1 , . . . , fc — 1 , and hence 

iog#w< r^ifciog-. 

d £ 

With the indicated choices of k, this yields the bound given in the statement of 
the lemma. □ 
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If the spectral measure satisfies the stronger tail condition 
/ exp((5|A|'') fi{d\) < oo for some S > and r > 1, the elements of the 
RKHS can be extended to the entire complex plane and satisfy an exponential 
type restriction. In that case the results of Section 7.4 of [7] apply and can 
be used to improve the power 2 appearing in the entropy bound given by the 
lemma. In the statistical results, this improves the power of the logarithmic 
factors that we have in the rate of contraction results. 

The following is a consequence of the preceding lemma and the well-known 
connection between the entropy of the unit ball of the RKHS and small ball 
probabilities, cf. Kuelbs and Li [8], Li and Linde ^14j . 

Theorem 2.4. Suppose the spectral measure satisfies ^2. ^|) and c < 1. Then 
there exists an > 0, independent of c, such that the rescaled process 
satisfies 

-logPr( sup \W^\ < 2e) < i(log^)' 

o<t<i c ce 

fore £ (0,en)- 

Proof. Let tpc{£) = — logPr(supo<(<i \ W^\ < e) be the quantity of interest. The 
preceding lemma and Theorem 1.2 of [14] imply we have the crude bound 

Me) < c-2/(2-")£-2"/(2-a) (3.4) 

for every a G (0,2). According to the proof of Theorem 1.2 and the related 
Proposition 3.1 of [T^, this bound holds for all £ > satisfying 

c < (^c(e/2))"/2£~". 

We have c < 1 by assumption and hence 

i^e(e/2) = -logP( sup \Wt/c\ < e/2) > -logP( sup |Wt| < e/2). 

0<t<l 0<t<l 

Since the right-hand side is independent of c, it follows that (|2.4p holds for all 
£ in an interval independent of c. The preceding lemma and Theorem 2 of [S] 
imply that for £ small enough 

^c(2£)<i(logv^)^ 

C V £ / 

Again, inspection of the proof of the cited result of [8j shows that under our 
assumption c < 1, this holds for all £ > in an interval independent of c. 
Combination of the preceding display with (|2.4p now yields the statement of 
the theorem. □ 

2.2. Multiply integrated Brownian motion 

Consider a mean-zero Gaussian process {Wt ■ t >0) that is self-similar of order 
a: the processes {c"Wt/c ■ t > 0) and {Wt ■ t > 0) are equal in distribution for 
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every c > 0. The rescaled process {Wt/c : < i < 1) given in (jl.ip is then equal 
in distribution to the process {c~°'Wt ■ < t < 1), which means that for use 
as a prior distribution the rescaling of the time-axis is equivalent to a rescaling 
of the vertical axis. The rescaling has a simple effect on the reproducing kernel 
Hilbert space and small ball probability, but it has an interesting consequence. 

We assume that the restriction {Wt : < t < 1) of the process to the unit 
interval and the rescaled process = {Wt/c • < i < 1) can be viewed as 
Borel measurable maps in a separable Banach space (B, || • ||), and that the 
self-similarity can be understood in the sense that the Borel laws of these two 
processes are identical. The RKHS and small ball probability of the process 
= {Wt/c : < t < 1) are then equal to these objects for the process 
(c-"Wt : < i < 1). Let Hvi/ be the RHKS of the process W (restricted to 
the unit interval) and let <fo{£', W) = — logPr(|| W^|| < e) be the exponent in its 
centered small ball probability. The following lemma is clear from the preceding. 

Lemma 2.5. The RKHS of the process is the set of functions Mw equipped 
with the norm \\h\\w = c"||/i||vk- The centered small ball exponent of sat- 
isfies -logPr(||W^'=|| < e) = ipo{c°'e;W). 

As an example consider the fc-fold integrated Brownian motion. Define /q^/ 
as the function t t-^ Jq f{s)ds and set /q+Z = /o+(/o+^/)- Because Brownian 
motion B is self-similar of index 1/2, the process W — Iq^B is self-similar of 
order a = k + 1/2. We consider the restriction of this process to [0, 1] as a map 



The fact that the integrated Brownian motion has k derivatives at equal 
to zero causes that the functions in its reproducing kernel Hilbert space satisfy 
similar constraints at 0. A better prior is obtained by adding an independent 
polynomial to the process. We consider the modified process 



for scaling factors c, a > 0, B a, standard Brownian motion and independent 
standard normal variables Zq, ■ ■ ■ , Zk, independent of B. 

The following theorem gives a centered small deviation bound for the process 
and describes the approximation of smooth functions by elements of its 
RKHS H'^'". 

Theorem 2.6. Consider the process V'^'"' given in \2.5\) as a map in C[0, 1]. 
This process satisfies, for e > small enough, 



in C[0, 1]. 




i=0 



(2.5) 




1 



1 



Moreover, for w G C^[Q. 1] and /? < fc + 1, 



inf{||/^||^.,„:|l/^-^||oo<£}<c2'=+^(-) 



In (2fe+2-2/3)//3 




In ((2fe-2/3)//3)V0 
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Proof. The assertion on the sinaU ball probability follows easily from Theorem 
2.1 of Li and Linde [13] on the small ball probability of integrated Brown- 
ian motion, and the fact that the added polynomial is independent and finite- 
dimensional. 

By general arguments (e.g. [21], Section 10) we have that the reproducing ker- 
nel Hilbert space of the process p.Sp viewed as a map in C[0, 1] is the Sobolev 
space [0, 1] of functions h : [0, 1] ^ R that are k times continuously dif- 

ferentiable with absolutely continuous A;th derivative that is the integral of a 
function g ^2(0, 1], equipped with the norm with square 

i=0 

For a smooth function ip and iptj^x) = Lp[xla)la its scaled version, the convo- 
lution w * ip„ is contained in the RKHS, with square norm 

i=0 

If ip is chosen such that J ip{x) dx — 1 and with zero moments of orders 1, . . . , /c, 
then the distance 1 1 w * (/So- ~ if I ! oo in the uniform norm can be seen to be of order 
. We choose a = e^/^ and next evaluate the preceding display to be of the 
order as given in the theorem. □ 

3. Posterior contraction rates 

In this section we present the main results on posterior convergence rates using 
rescaled Gaussian process priors. We denote the posterior distribution based on 
a prior n„ and observations X'^"-^ by i? i— > n„(i? | X'"'). 

We consider three different statistical settings: i.i.d. density estimation, clas- 
sification, and fixed design regression. For any of these, the general theory devel- 
oped in [20J gives results expressing posterior contraction rates in terms of the 
small deviations behaviour and the RKHS structure of the Gaussian prior. The 
results in the present section are obtained by combining these general results 
with the material of the preceding section. 

Below we give complete proofs for the density estimation case. Since the other 
two cases are completely analogous we only explain the results briefly. 

3.1. Density estimation 

Suppose that we observe a random sample Xi , . . . , X„ from a positive density 
Po on [0, 1]. A prior distribution on the set of positive densities can be defined 
structurally as pw, for a Gaussian process W = {Wt : t £ [0, 1]) and the 
function defined by 



A.W. van der Vaart and J.H. van Zanten/Rescaled Gaussian process priors 445 



In the next two theorems we assume that the true density is a-smooth in the 
sense that logpo S C"[0, 1]. We show that if in this case we take for W a 
suitably rescaled Gaussian process, we obtain a posterior that (perhaps up to 
logarithmic factors) contracts around the true density at the optimal minimax 
rate 

The first result deals with rescaled smooth stationary processes. 

Theorem 3.1. Let a > be fixed. Let W — {Wt : t > 0) be a centered, 
stationary Gaussian process with spectral measure ii satisfying condition \2. 2\) 
for some 6 > 0, and possessing a positive Lebesgue density. Let W" — {Wt/c„ : 
t G [0, 1]) be the rescaled version ofW, for scaling constants c„ — > 0. Define the 
prior Hn structurally aspw^, with pyj as in iS.l\) . Then if logpQ e C"[0, 1], we 
have 

Eon„(p : h{p,po) > Men | Xi, . . . , X„) ^ 

for all M large enough, where e„ = V (log n) / ^nc„ and h is the Hellinger 
distance on densities. For 



2 , ^ 



log n 



n 

2 , 



2a+l 



this gives the rate = (n/log n) 1+2° . 

Proof. By Theorem 3.1 of [2^ we get the conclusion of the theorem as soon as 
we show that ((C„(e„) < ne^, where 

^nisn) - _ „^ inf \\h\\l^ - logPr(||W^"|| < e„), 

with E[„ the RKHS of the rescaled process VF". Hence, by Lemma 12.21 and 
Theorem 12.41 it suffices to verify that 

1 / 1 \2 1 
— (log jj < ne^, < £„ and — < nel- 

It is easy to check that these relations indeed hold for c„ and e„ as in the 
statement of the theorem. □ 

The following theorem gives the analogous result for rescaled integrated 
Brownian motions. 

Theorem 3.2. For a > and k e No, let be the modified k-fold integrated 
Brownian motion defined in \2.5\) . with the scaling constant c replaced by 



C„ = 7j(fc+l/2)(l + 2Q) 

l + 2Q-2fc 

and a replaced by a sequence a„ satisfying a„ < n 1+2° . Define the prior n„ 
structurally aspv^, with p^^ as in Ii3.1]) . Then i/logpo G C"[0, 1] and a < k+1, 
we have 

Eon„(p : h{p,po) > Men | Xi, . . . , X„) ^ 

for all M large enough, where en — ^+^° and h is the Hellinger distance on 
densities. 
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Proof. By Theorem 3.1 of 20J and Theorem 12. 61 it suffices to verify 

l/(fc+l/2) 1 



/ 1 \l/(fc+l/2) 



5, nel 



and 

For c„, e„ and a„ as in the theorem the left-hand sides of the displays are 
dominated by the first terms. Hence, it remains to check that 



l/(fc+l/2) 



and 

2fc+l/l X (2fe+2-2a)/a 

which is straightforward. □ 



3.2. Fixed design regression 

Suppose that we observe independent variables Yi , . . . , 1^ following the regres- 
sion model Yi = wo{ti) + Ci for unobservable iV(0, ao)-distributed errors Ci and 
fixed, known elements ti, . . . ,t„ of the unit interval. Consider estimating the 
regression function w. 

As a prior on w we use the Gaussian processes W" from Theorem 13.11 or 
from Theorem 13.21 If the standard deviation ctq of e is not known, we also put a 
prior on ao, which we assume to be supported on a given interval [a, b] C (0, 00) 
with a Lebesgue density that is bounded away from zero. 

A combination of Theorem 3.3 of [2U] and the results of Section [2] then shows 
that if Wo £ C"[0, 1] and ctq G [a, 6] the analogues of the statements of the 
Theorems for the density estimation case are true in this setting as well. We get 
the same rates of posterior contraction, and the statement of the theorems has 
to be replaced by 

Eon„((w,cr) : \\w - WolU + k - o-ol > I Yi, . . . ,F„) 

for all M large enough, where \\f\\n = n^^J2f'^i^'i)- 



3. 3. Classification 

Suppose that we observe a random sample of vectors {Xi,Yi), . . . , y„) from 
the distribution of {X, Y), where Y takes its values in the set {0, 1} and X takes 
its values in the unit interval. Consider estimating the binary regression function 
foit)=PviY = l\X = t). 
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We construct a prior on the set of regression functions as fw- (or fv") for 
W"- (or V"-) the Gaussian process from Theorem l3 . 1 1 for [3?^ and the function 
fw{t) — ^'(wt), where ^' : R ^ (0, 1) is (for instance) the logistic distribution 
function. 

Theorem 3.2 of [20] and the results of Section [2] imply that if ^"^(/o) S 
C"[0, 1], the analogues of the statements of Theorems 13.11 and 13.21 hold in this 
setting. We get the same rates of posterior contraction in this case as well, the 
statement of the theorems has to be replaced by 

Eon„(/ : II/- /o||g,2 > Men \ (^l,l^l), • • • , {Xn,Yn)) ^ 

for any sufficiently large constant M, where ||/||g. 2 = / f^{t)dG{t) and G is the 
marginal distribution of X. 
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